Hierarchical Clustering#

Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters (not just partition of the sample into disjoint sets, but a system of nested partitions).

There are two stategies for hierachy clustering:

  • Divisive or top-down algorithms split the sample into smaller and smaller clusters

  • Agglomerative or bottom-up algorithms, in which objects are combined into larger and larger clusters

../_images/aglo_and_divisive.jpeg

The agglomerative approach is more popular and effective. The algorithm itself looks pretty simple and consists of the following steps:

  • Initialization: Create as many clusters as there are objects, each object in its own separate cluster.

  • Repeat: Do following steps until the stopping criterion is met

    • Compute distances: Calculate the pairwise distance or similarity between each pair of clusters.

    • Merging: Combine of the two closest clusters into a new cluster.

from jupyterquiz import display_quiz
q_demo_seq = [{
    "question": "Imagine, we run this algorithm till we have one big cluster which includes all $N$ samples. Using the naive approach what is the number of operations required for this? ",
    "type": "multiple_choice",
       "answers": [
        {"answer": "$\\mathcal{O}(N^2)$", "correct": False, "feedback": "Oops! It's not correct! Try again."}, 
        {"answer": "$\\mathcal{O}(N^3)$", "correct": True, "feedback": "The number of iterations can be at most $N−1$ (because in each iteration, we merge two clusters and reduce the total number of clusters by one). At each iteration we compute pairwise distance between each clusters and find closest pair, which requires $\mathcal{O}(n^2)$ operations. Therefore, the worst-case time complexity for the naive agglomerative hierarchical clustering algorithm is $\mathcal{O}((N * N^2)$ = $\mathcal{O}(N^3)$."}, 
        {"answer": "$\\mathcal{O}(N)$", "correct": False, "feedback": "Oops! It's not correct! Try again."}, 
        {"answer": "$\\mathcal{O}(N^2 \log N)$", "correct": False, "feedback": "Oops! It's not correct! Try again."}]}]
display_quiz(q_demo_seq)

Linkage and Distance#

At first, each object is considered as one cluster. For single element clusters, the distance function is naturally defined:

\(R(\{x\},\{y\})=\rho (x, y)\)

The distances between objects \(\rho\) can be specified by any metric, both Euclidean and Manhattan distance or, for example, the cosine measure. Then the process of clusters merging starts. At each iteration, instead of the pair of closest clusters \(U\) and \(V\), a new cluster \(W = U \cup V\) is formed. There are several methods of identifing closest clusters:

  • Single linkage aka Nearest Point Algorithm: The distance between clusters is defined by the distance between their closest members.

../_images/single.jpeg
\[\hspace{2mm} R_{min}(U, V)=\min\limits_{u \in U, v \in V}\rho (u, v)\]
  • Complete linkage aka Farthest Point Algorithm: The distance between clusters is defined by the distance between their furthest members.

../_images/complete.jpeg
\[\hspace{2mm} R_{max}(U, V)=\max\limits_{u \in U, v \in V}\rho (u, v)\]
  • Unweighted Pair Group Method with Arithmetic mean (UPGMA): Average-average distance or average linkage is the method that involves looking at the distances between all pairs and averages all of these distances.

../_images/avg.jpeg
\[\hspace{2mm} R_{avg}(U, V)=\dfrac{1}{|U|\cdot|V|}\sum\limits_{u \in U}\sum\limits_{v \in V}\rho (u, v)\]
  • Unweighted Pair Group Method with Centroid average (UPGMC): A point defined by the mean of all points (centroid) is calculated for each cluster and the distance between clusters is the distance between their respective centroids.

../_images/center.jpeg
\[\hspace{2mm} R_{cen}(U, V)=\rho^2(\sum\limits_{u \in U}\dfrac{u}{|U|}, \sum\limits_{v \in V}\dfrac{v}{|V|})\]
  • Ward’s method: It is based on the increase in the sum of squared errors (ESS) that results from merging two clusters. It aims to minimize the variance within the newly formed cluster. The idea is to choose the pair of clusters for which the merging leads to the smallest increase in the total within-cluster sum of squares.

../_images/ward.jpeg
\[\hspace{2mm} R_{ward}(U, V)=\dfrac{|U|\cdot|V|}{|U|+|V|}\rho^2(\sum\limits_{u \in U}\dfrac{u}{|U|}, \sum\limits_{v \in V}\dfrac{v}{|V|})\]



At each step, it is good to be able to quickly calculate the distance from the formed cluster \(W = U \cup V\) to any other cluster \(S\), using known distances from previous steps. This is easily accomplished using the equation proposed by Lance and Williams in 1967:

\[R(W,S)=\alpha_{U}\cdot R(U,S)+\alpha_{V}\cdot R(V,S)+\beta\cdot R(U,V)+\gamma\cdot|R(U,S)−R(V,S)|\]
\[\alpha_{U}, \alpha_{V}, \beta, \gamma - \text{some parameters}\]

Pros and Cons#

There is no exact answer to the question of which linkage algorithm is better. Each of the distances listed above has its own disadvantages and is not suitable for all tasks.

  • The single linkage method has a chain effect when, regardless of the shape of the cluster, objects closest to the boundary are added to it. The nearest neighbor method is well suited for identifying ribbon-shaped clusters (like moons).

  • The complete linkage method does not have a chain effect, but at an early stage it can unite rather dissimilar groups.

  • The UPGMC quite intuitive and seems to be the “golden mean”.

  • Ward’s method turned out to be the best according to the results of experimental comparison on a representative set of model problems. It recovers the intuitively best clustering more often than other methods.

Dendrogram#

As clusters merge, each iteration of the algorithm corresponds to a pair of clusters merged at this iteration, as well as the distance between the clusters at the moment of merging. Distances will only increase with iteration, so it becomes possible to visualize the result in the form of a beautiful cluster tree called a dendrogram.

On the diagram above, the synthetic data is shown on the left, and the dendrogram itself based on this data is on the right. On the dendrogram, sample objects are marked along the horizontal axis, distances between clusters are plotted vertically. Cluster merges correspond to horizontal lines.

We can set linkage distance threshold at or above which clusters will not be merged. Play with slider to understand how clusters change depending on threshold value.

Number of clusters

To determine the number of clusters, the interval of maximum length \(|R_{t+1}−R_{t}|\) is found. The final clusters are the clusters obtained at step \(t\). In this case, the number of clusters is equal to \(m−t+1\). However, when the number of clusters is unknown in advance and there are not very many objects in the sample, it can be useful to look at full dendrogram.

Iris Dataset#

Let’s test Ward’s method on some real dataset.

import plotly.express as px

df = px.data.iris()
X = df[["sepal_length", "sepal_width", "petal_width"]]

clusters = predict_clusters(X, n_clusters=3)

fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'surface'}, {'type': 'surface'}]])

fig.add_trace(
    go.Scatter3d(
        x=df["sepal_length"],
        y=df["sepal_width"],
        z=df["petal_width"],
        mode='markers',
        marker=dict(
            color=df["species_id"]
        )
    ),
    row=1, col=1,
)

fig.add_trace(
    go.Scatter3d(
        x=df["sepal_length"],
        y=df["sepal_width"],
        z=df["petal_width"],
        mode='markers',
        marker=dict(
            color=clusters
        )
    ),
    row=1, col=2
)

fig.update_traces(marker_size = 5, showlegend=False)

fig.update_layout(scene={'domain': {'x': [0.0, 0.5], 'y': [0.0, 1.0]},
              'xaxis': {'title': {'text': 'sepal_length'}},
              'yaxis': {'title': {'text': 'sepal_width'}},
              'zaxis': {'title': {'text': 'petal_width'}}})
fig.show()

On the plot above you can see result of clustering (right plot). You can see it pretty similar with ground truth (left plot).

display_quiz("#question6")
display_quiz("#question7")